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Abstract —We introduce a novel algorithm for solving learning 
problems where both the loss function and the regularizer are 
non-convex but belong to the class of difference of convex (DC) 
functions. Our contribution is a new general purpose proximal 
Newton algorithm that is able to deal with such a situation. 
The algorithm consists in obtaining a descent direction from an 
approximation of the loss function and then in performing a 
line search to ensure sufficient descent. A theoretical analysis is 
provided showing that the Iterates of the proposed algorithm 
admit as limit points stationary points of the DC objective 
function. Numerical experiments show that our approach is more 
efficient than current state of the art for a problem with a 
convex loss function and a non-convex regularizer. We have 
also illustrated the benefit of our algorithm in high-dimensional 
transductive learning problem where both loss function and 
regularlzers are non-convex. 

Index Terms —Difference of convex functions, non-convex reg¬ 
ularization, proximal Newton, sparse logistic regression. 

I. Introduction 

In many real-world application domains such as computa¬ 
tional biology, finance or text mining, datasets considered for 
learning prediction models are routinely large-scale and high¬ 
dimensional raising the issue of model complexity control. 
One way for dealing with such kinds of dataset is to learn 
sparse models. Hence, a very large amount of recent works 
in machine learning, statistics and signal processing have 
addressed optimization problems related to sparsity issues. 

One of the most popular algorithm for achieving sparse 
models is the Lasso algorithm [1] also known as the Basis 
pursuit algorithm [2] in the signal processing community. 
This algorithm actually applies £i-norm regularization to the 
learning model. The choice of the norm comes from its 
appealing properties which are convexity, continuity and its 
ability to produce sparse or even the sparsest model in some 
cases owing to its non-differentiability at zero [3], [4]. Since 
these seminal works, several efforts have been devoted to 
the development of efficient algorithms for solving learning 
problems that consider sparsity-inducing regularizers [5]-[8]. 
However, regularizer presents some drawbacks such as 
its inability, in certain situations to retrieve the true relevant 
variables of a model [9], [10]. Since the £i-norm regularizer 
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is a continuous and convex surrogate of the £o pseudo¬ 
norm, other kinds of regularizer which abandon the convexity 
property, have been analyzed by several authors and they have 
been proved to achieve better statistical property. Common 
non-convex and non-differentiable regularizers are the SCAD 
regularizer [10], the ip regularizer [11], the capped-£i and the 
log penalty [12]. These regularizers have been frequently used 
for feature selections or for obtaining sparse models [12]-[14]. 

While being statistically appealing, the use of these non- 
convex and non-smooth regularizers poses some challenging 
optimization problems. In this work, we propose a novel 
efficient non-convex proximal Newton algorithm. Indeed, one 
of the most frequently used algorithms for solving £i-norm 
regularized problem is the proximal gradient algorithm [15]. 
Recently, proximal Newton-type methods have been intro¬ 
duced for solving composite optimization problems involving 
the sum of a smooth and convex twice differentiable function 
and a non-smooth convex function (typically the regularizer) 
[16], [17]. These proximal Newton algorithms have been 
shown to be substantially faster than their proximal gradient 
counterpart. Our objective is thus to go beyond the state-of- 
the-art by proposing an efficient proximal Newton algorithm 
that is able to handle machine learning problems where the loss 
function is smooth and possibly non-convex and the regularizer 
is non-smooth and non-convex. 

Based on this, we propose an effficient general proximal 
Newton method for optimizing a composite objective function 
/(x) -|- /i(x) where both functions / and h can be non- 
convex and belong to a large class of functions that can 
be decomposed as the difference of two convex functions 
(DC functions) [18]-[20]. In addition, we also allow /i(x) 
to be non-smooth, which is necessary for sparsity promoting 
regularization. The proposed algorithm has a wide range of 
applicability that goes far beyond the handling of non-convex 
regularizers. Indeed, our global framework can genuinely 
deal with non-convex loss functions that usually appear in 
learning problems. To make concrete the DC Newton proximal 
approach, we illustrate the relevance and the effectiveness 
of the novel algorithm by considering a problem of sparse 
transductive logistic regression in which the regularizer as 
well as the loss related to the unlabeled examples are non- 
convex. As far as our knowledge goes, this is the first work 
that introduces such a model and proposes an algorithm for 
solving the related optimization problem. In addition to this 
specific problem, many non-convex optimization problems 
involving non-convex loss functions and non-convex and non- 
differentiable regularizers arise in machine learning e.g dictio¬ 
nary learning [21], [22] or matrix factorization [23] problems. 
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In addition, several works have recently shown that non- 
convex loss functions such as the Ramp loss which is a DC 
function, lead to classifiers more robust to outliers [24], [25]. 
We thus believe that the proposed framework is of general 
interest in machine learning optimization problems involving 
this kind of losses and regularizers. 

The algorithm we propose consists in two steps: first it seeks 
a search direction and then it looks for a step-size in that 
direction that minimizes the objective value. The originality 
and main novelty we brought in this work is that the search 
direction is obtained by solving a subproblem which involves 
both an approximation of the smooth loss function and the 
DC regularizes Note that while our algorithm for non-convex 
objective function is rather similar to the convex proximal 
Newton method, non-convexity and non-differentiability raise 
some technical issues when analysing the properties of the 
algorithm. Nonetheless, we prove several properties related 
to the search direction and provide convergence analysis of 
the algorithm to a stationary point of the related optimization 
problem. These properties are obtained as non-trivial extension 
of the convex proximal Newton case. Experimental studies 
show the benefit of the algorithm in terms of running time 
while preserving or improving generalization performance 
compared to existing non-convex approaches. 

The paper is organized as follows. Section II introduces 
the general optimization problem we want to address as well 
as the proposed DC proximal Newton optimization scheme. 
Details on the implementation and discussion concerning 
related works are also provided. In Section III, an analysis of 
the properties of the algorithm is given. Numerical experiments 
on simulated and real-world data comparing our approach to 
the existing methods are depicted in Section IV, while Section 
V concludes the paper. 

II. DC PROXIMAL Newton algorithm 

We are interested in solving the following optimization 
problem 

min F{x) := /(x) + h{x) (1) 

with the following assumptions concerning the functions / and 
h. f is supposed to be twice differentiable, lower bounded on 
and we suppose that there exists two convex functions /i 
and /2 such that / can be written as a difference of convex 
(DC) functions /(x) = /i(x) — / 2 (x). We also assume that 
fi verifies the L-Lipschitz gradient property 

||V/i(x) - V/i(y)|| < L||x-y|| Vx,y e dom/i. 

The DC assumption on / is not very restrictive since any 
differentiable function /(■) with a bounded Hessian matrix 
can be expressed as a difference of convex function [26]. 

The function h is supposed to be a lower-bounded, proper, 
lower semi-continuous and its restriction to its domain is 
continuous. We suppose that h that can also be expressed as 

/l(x) = /ll(x) -/l 2 (x) (2) 

where hi and /12 both convex functions. As discussed in 
the introduction, we focus our interest in situations where h 


is non-convex and non-differentiable. As such hi and ft ,2 
also expected to be non-differentiable. A large class of non- 
convex sparsity-inducing regularizers can be expressed as a 
DC function as discussed in [14]. This includes the classical 
SCAD regularize^ the £p regularize^ the capped-fi and the 
log penalty as above-mentioned. 

Note that those assumptions on / and h cover a broad class 
of optimization problems. Proposed approach can be applied 
for sparse linear model estimation as illustrated in in Section 
IV. But more general learning problems such as those using 
overlapping nonconvex £p — £i (with p < 1) group-lasso as 
used in [27] can also be considered. Our framework also 
encompasses those of structured sparse dictionary learning or 
matrix factorization [22], [28], sparse and low-rank matrix 
estimation [29], [30], or maximum likekihood estimation of 
graphical models [31], when the £i sparsity-inducing regular- 
izer is replaced for instance by a more aggressive regularizer 
like the log penalty or the SCAD regularizer. 

A. Optimization scheme 

For solving Problem (1) which is a difference of convex 
functions optimization problem, we propose a novel iterative 
algorithm which first looks for a search direction Ax and then 
updates the current solution. Formally, the algorithm is based 
on the iteration 

X/C-I-I = X/j; -j- t/j-AX/j; 

where tk and Ax^ are respectively a step size and the search 
direction. Similarly to the works of Fee et al. [16], the search 
direction is computed by minimizing a local approximation 
of the composite function F{x.). However, we show that by 
using a simple approximation on /i, /2 and h 2 , we are able to 
handle the non-convexity of F{x), resulting in an algorithm 
which is wrapped around a specific proximal Newton iteration. 

For dealing with the non-convex situation, we define the 
search direction as the solution of the following problem 

Axfc = argmin/(xfc + Ax) + h{x.k -I- Ax) (3) 

Ax 

where / and h are the following approximations of respec¬ 
tively / and /i at Xfc. We define /(x) as 

/(x) = /i(xfc)-f V/i(xfc)^(x-Xfc) (4) 
+ -(x - Xfe)’^Hfe(x - Xfc) 

- /2(xfe) -zJjx-Xfc) 

where Zf^ = V/ 2 (xfe) and is any positive definite 
approximation of the Hessian matrix of fi at current iterate. 
We also consider 

/l(x) = /ll(x) - ft, 2 (xfe) - z^^(x - Xfc) (5) 

where z^a G 9/i2(xfe), with the latter being the sub-differential 
of /i 2 at Xfc. 

Note that the first three summands in Equation (4) form 
a quadratic approximation of /i(x) whereas the terms in the 
third line of Equation (4) is a linear approximation of / 2 (x). 
In the same spirit, h is actually a majorizing function of h 
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Algorithm 1 DC proximal Newton algorithm 
1: Initialize Xq € domF 
2: fc = 0 

3: repeat 

4: compute Zftj € i9ft,2(xfc) and z/^ = V/ 2 (xfc) 

5: update Hfc (exactly or using a quasi-Newton approach) 

6 : Vfc V/i(xfc) - z/2 - Z^2 

7: Axfc 4- proxf^*’ (xfc - H“^Vfe) - Xfc 

8: compute the stepsize tk through backtracking starting 

from tk = 1 

9: Xfe+i = Xfe -I- tfcAxfc 

10: fc fc -I- 1 

11: until convergence criterion is met 


since we have linearized the convex function ft ,2 and is a 
difference of convex functions. 

We are now in position to provide the proximal expression 
of the search direction. Indeed, Problem (3) can be rewritten 
as 

argmin -Ax^H^Ax + hi{xk + Ax) + v^^Ax (6) 

Ax 2 

with Vfc = V/i(xfc) — z /2 —Zh 2 - After some algebras given in 
the appendix and involving optimality conditions of a proximal 
Newton operator, we can show that 

Axfc = prox^^*” (xfc - H” Vfc) - xfe (7) 

with by definition [15], [16] 

prox“ (x) = argmm i||x - y||^ -f /ii(y) 

where ||x||^ = x^Hx is the quadratic norm with metric H. 
Interestingly, we note that the non-convexity of the initial 
problem is taken into account only through the proximal 
Newton operator and its impact on the algorithm, compared to 
the convex case, is minor since it only modifies the argument 
of the operator through v^. 

Once the search direction is computed, the step size tk is 
backtracked starting from = 1. Algorithm 1 summarizes the 
main steps of the optimization scheme. Some implementation 
issues are discussed hereafter while the next section focuses 
on the convergence analysis. 

B. Implementation’s tricks of the trade 

The main difficulty and computational burden of our DC 
proximal Newton algorithm resides in the computation of the 
search direction Ax^.. Indeed, the latter needs the computation 
of the proximal operator prox^^*” (x^ — which is equal 

to 

argmin iy^Hfey-Ty^(vfc - HfcXfc) +/ii(y) (8) 

V 

9(y) 

We can note that Equation 8 represents a quadratic problem 
penalized by h^. If hi{y) is a term which proximal operator 
can be cheaply computed then, one can consider proximal 


Table I 

Summary of related approaches according to how /(x) and 

h{x) ARE DECOMPOSED IN fi — f2 AND hi — /l2. CVX AND ncvx 
RESPECTIVELY STANDS FOR convex AND non-convex. — DENOTES THAT 
THE METHOD THAT DOES NOT HANDLE DC EUNCTIONS. THE metric 
COLUMN DENOTES THE EORM OE THE METRIC USED IN THE QUADRATIC 
APPROXIMATION. 



1 f(x) 

h(x) 

metric 

Approach 

1 h 

h 

hi 

h2 

H 

proximal gradient [15] 

CVX 

CVX 

- 

— I 

2 

proximal Newton [16] 

CVX 

CVX 

- 

Hfc 

GIST [34] 

ncvx 

CVX 

CVX 

—I 

2 

SQP [35] 

CVX 

CVX 

CVX 

CVX 

our approach 

CVX 

CVX 

CVX 

CVX 

Hfc 


gradient algorithm or any other efficient algorithms for its 
resolution [6], [32]. 

In our case, we have considered a forward-backward (FB) 
algorithm [15] initialized with the previous value of the opti¬ 
mal y. Note that in order to have a convergence guarantee, the 
FB algorithm needs a stepsize smaller than ^ where L is the 
Lipschitz gradient of the quadratic function. Again computing 
L can be expensive and in order to increase the computational 
efficiency of the global algorithm, we have chosen a strategy 
that roughly estimates L according to the equation 

l|Vg(y) - Vg(y')ll2 
I|y-y'll2 

In practice, we have found this heuristic to be slightly more 
efficient than an approach which computes the largest eigen¬ 
value of Hfc by means of a power method [33]. Note that a 
L-BFGS approximation scheme has been used in the numerical 
experiments for updating the matrix Hfc and its inverse. 

While the convergence analysis we provide in the next sec¬ 
tion supposes that the proximal operator is computed exactly, 
in practice it is more efficient to approximately solve the search 
direction problem, at least for the early iterations. Following 
this idea, we have considered an adaptive stopping criterion 
for the proximal operator subproblem. 

C. Related works 

In the last few years, a large amount of works have been 
devoted to the resolution of composite optimization problem 
of the form given in Equation (1). We review the ones that are 
most similar to ours and summarize the most important ones 
in Table I. 

Proximal Newton algorithms have recently been proposed 
by [16] and [17] for solving Equation (1) when both functions 
/(x) and h{x.) are convex. While the algorithm we propose 
is similar to the one of [16], our work is strictly more general 
in the sense that we abandon the convexity hypothesis on 
both functions. Indeed, our algorithm can handle both convex 
and non-convex cases and boils down to the algorithm of 
[16] in the convex case. Hence, the main contribution that 
differentiates our work to the work of Lee et al. [16] relies on 
the extension of the algorithm to the non-convex case and the 
theoretical analysis of the resulting algorithm. 

Following the interest on sparsity-inducing regularizers, 
there has been a renewal of curiosity around non-convex 
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optimization problems [12], [13]. Indeed, most statistically 
relevant sparsity-inducing regularizers are non-convex [36]. 
Hence, several researchers have proposed novel algorithms for 
handling these isssues. 

We point out that linearizing the concave part in a DC 
program is a crucial idea of DC programming and DCA that 
were introduced by Pham Dinh Tao in the early eighties and 
have been extensively developed since then [18], [19], [37]. 
In this work, we have used this same idea in a proximal 
Newton framework. However, our algorithm is fairly different 
from the DCA [19] as we consider a single descent step 
at each iteration, as opposed to the DCA which needs a 
full optimization of a minimization problem at each iteration. 
This DCA algorithm has as special case, the convex concave 
procedure (CCCP) introduced by Yuille et al. [26] and used 
for instance by Collobert et. al [38] in a machine learning 
context. 

This idea of linearizing the (possibly) non-convex part of 
Problem (1) for obtaining a search direction can also be 
found in Mine and Fukushima [39]. However, in their case, 
the function to be linearized is supposed to be smooth. The 
advantage of using a DC program, as in our case, is that 
the linearization trick can also be extended to non-smooth 
function. 

The works that are mostly related to ours are those proposed 
by [34] and [35]. Interestingly, Gong et al. [34] introduced 
a generalized iterative shrinkage algorithm (GIST) that can 
handle optimization problems with DC regularizers for which 
proximal operators can be easily computed. Instead, Lu [35] 
solves the same optimization problem in a different way. 
As the non-convex regularizers are supposed to be DC, he 
proposes to solve a sequence of convex programs which at 
each iteration minimizes 

/(x) -I- /ll(x) - /l 2 (xfc) - Z^Jx - Xfc) 

with 

/(x) = /l(xfc) -f V/l(xfc)^(x - Xfc) -f -^llx - Xfcf 

Note that our framework subsumes the one of Lu [35] (when 
considering unconstrained optimization problem). Indeed, we 
take into account a variable metric into the proximal 
term. Thus, the approach of Lu can be deemed a particular 
case of our method where = LI at all iterations of the 
algorithm. Hence, when /(x) is convex, we expect more 
efficiency compared to the algorithms of [34] and [35] owing 
to the variable metric that has been introduced. 

Very recently, Chouzenoux et al. [40] introduced a proximal 
Newton-like algorithm for minimizing the sum of a twice 
differentiable function and a convex function. They essentially 
consider that the regularization term is convex while the loss 
function may be non-convex. Their work can thus be seen as 
an extension of the one of [41] to the variable metric case. 
Compared to our work, [40] do not impose a DC condition 
on the function /(x). However, at each iteration, they need a 
quadratic surrogate function at a point x^. that majorizes /(x). 
In our case, only the non-convex part is majorized through a 
simple linearization. 


III. Analysis 

Our objective in this section is to show that our algorithm 
is well-behaved and to prove at which extents the iterates 
{xfc} converge to a stationary point of Problem (1). We first 
characterize stationary points of Problem 1 with respects to 
Ax and then show that all limit points of the sequence {x^} 
generated by our algorithm are stationary points. 

Throughout this work, we use the following definition of a 
stationary point. 

Definition 1: A point x* is said to be a stationary point of 
Problem (1) if 

0 € V/i(x*) - V/ 2 (x*) + a/ii(x*) - 9/12 (x*) 

Note that being a stationary point, as defined above, is a 
necessary condition for a point x* to be a local minimizer 
of Problem (1). 

According to the above definition, we have the following 
lemma : 

Lemma 1: Suppose H* 0, x* is a stationary point of 
Problem (1) if and only if Ax* = 0 with 

Ax* = arg min(v*)^d + id^H*d -f hi (x* -f d) (9) 

d 2 

and V* = V/i(x*) - z}^ - z*^, = V/ 2 (x*) and z*^ € 

9/i2(x*). 

Proof : Let us start by characterizing the solution Ax*. 
By definition, we have Ax* -f x* = prox^^*(x* — H“^v*) 
and thus according to the optimality condition of the proximal 
operator, the following equation holds 

H*(x* - H-V* - Ax* - X*) G a/ll (Ax* + X*) 

which after rearrangement is equivalent to 

2.1^ - H*Ax* G V/(x*) + a/ll (Ax* + X*) (10) 

with V/(x*) = V/i(x*) — V/ 2 (x*). This also means that 
there exists a z][^^ G a/ii(Ax* 4- x*) so that 

- H.Ax* - V/(x*) - zl^^ = 0 (11) 

Remember that by hypothesis, since x* is a stationary point 
of Problem (1), we have 

0 G v/(x*) -I- a/ii(x*) — a/i 2 (x*) 

We now prove that if x* is a stationary point of Problem 
(1) then Ax* = 0 by showing the contrapositive. Suppose 
that Ax* f- 0. Ax* is a vector that satisfies the optimality 
condition (10) and it is the unique one according to properties 
of the proximal operator. This means that the vector 0 is not 
optimal for the problem (9) and thus it does not exist a vector 
zjjio G a/ii(d + X*) so that 

zl^ - H*d - V/(x*) - zl^^ = 0 (12) 

with d = 0. Note that this equation is valid for any z^^ chosen 
in the set a/i 2 (x*) and the above equation also translates in 
G a/ii(x*) so that V/(x*) -f zjj^o - z^^ = 0, which 
proves that x* is not a stationary point of problem (1). 
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Suppose now that Ax* = 0, then according to the definition 
of Ax* and the resulting condition (10), it is straightforward 
to note that x* satisfies the definition of a stationary point. □ 

Now, we proceed by showing that at each iteration, the 
search direction Ax^ satisfies a property which implies that 
for a sufficiently small step size tk, the search direction is a 
descent direction. 

Lemma 2: For Xfc in the domain of / and supposing that 
Hfc ^0 then Axfe is so that 

^'(Xfc+i) < F(xfc) + tk (vj Axfc + /ii(Axfc + Xfc) - fii(xfc)^ 

+ 


After rearrangement we have the inequality 

v~''Ax + fii(x + Ax) — fii(x) < —-(1 + f)Ax^HAx 

which is valid for all t G [0,1] and in particular for t = 1 
which concludes the proof of inequality. By plugging this 
result into inequality (14), the descent property holds. □ 

Note that the descent property is supposed to hold for 
sufficiently small step size. In our algorithm, this stepsize tk 
is selected by backtracking so that the following sufficient 
descent condition holds 

F(xfc+i) - F(xfc) < atkDk (17) 


and 

F(xfc+i) - F(xfc) < -ffcAxfc HfcAxfc + 0{tl) (13) 

with Vfc = V/i(xfc) - z /2 - zti 2 - 

Proof: For a sake of clarity, we have dropped the index k and 
used the following notation, x Xfc, Ax := Axfc, x+ := 
Xfc + ffcAxfc. By definition, we have 

F(x+) - F(x) = /l(x+) - /i(x) - / 2 (x+) + / 2 (x) 

+ fil(x+) - /li(x) - fi 2 (x+) + fi 2 (x). 

Then by convexity of / 2 , h 2 , hi and for t € [0,1], we 
respectively have 

-Z/ 2 (x+ - x) > / 2 (x) - / 2 (x+), 

-^h2i^+ - X) > /t2(x) - /t2(x+) 

and 


with a € (0,1/2). The next lemma shows that if the function 
/i is sufficiently smooth, then there always exists a step size 
so that the above sufficient descent condition holds. 

Lemma 3: For x in the domain of / and assuming that 
Hfc A ml with m > 0 and V/i is Lipschitz with constant 
L then the sufficient descent condition in Equation (17) holds 
for all ffc so that 


tk < min 



Proof : This technical proof has been post-poned to the 
appendix. □ 


According to the above lemma, we can suppose that if some 
mild conditions on /i are satisfied (smoothness and bounded 
curvature) then, we can expect our DC algorithm to behave 
properly. This intuition is formalized in the following property. 


/ii(x + fAx) < thi{x + Ax) + (1 — t)hi{x) 

Plugging these inequalities in the definition of F(x+) — F(x) 
gives : 

F(x+) - F(x) < /i(x+) - /i(x) + (1 - f)fii(x) (14) 
+ thi{x + Ax) 

-f(z /2 + ZfcJ^Ax - fii(x) 

< fV/i(x)^Ax + f/ii(x + Ax) 

- ffii(x) - f(z /2 + Zfc 2 )^Ax + 0{t'^) 

which proves the first inequality of the lemma. 

For showing the descent property, we demonstrate that the 
following inequality holds 

v^Ax + hi{x + Ax) — fii(x) < —Ax^HAx (15) 

'-V-" 

D 

Since Ax is the minimizer of Problem (6), the following 
equation holds for fAx and t G [0,1]: 

-Ax^HAx + fii(x + Ax) + v^Ax (16) 

< — Ax^HAx + /ii(x + tAx) + fv^Ax 

< — Ax^HAx + (1 — t)hi{x) + thi{x + Ax) 
+fv^Ax 


Proposition 1: Suppose fi has a gradient which is Lipschitz 
continuous with constant L and that Hfc ^ ml for all k and 
m > 0, then all the limit points of the sequence {xfc} are 
stationary points. 

Proof: Let x* be a limit point of the sequence {xfc} then, 
there exists a subsequence K, so that 

lim Xfc = X* 
k^K. 

At each iteration the step size tk has been chosen so as to 
satisfy the sufficient descent condition given in Equation (17). 
According to the above Lemma 3, the step size tk is chosen 
so as to ensure a sufficient descent and we know that such 
a step size always exists and it is always non-zero. Hence 
the sequence {F(xfc)} is a strictly decreasing sequence. As F 
is lower bounded, the sequence {F{xk)} converges to some 
limit. Thus, we have 

lim F(xfc) = lim F(xfc) = F(x*) 

k—¥GO k—¥K. 

as F(-) is continuous. Thus, we also have 

lim F(xfc+i) - F(xfc) = 0 
k^K. 

Now because each term F(xfc+i) — F(xfc) is negative, we can 
also deduce from Equations (15) and (17) and the limit of 

F(xfc+i) - F(xfc) that 

lim Vfc Axfc+/ii(xfc-|-Axfc)-fii(xfc) = lim -Ax/)HfcAxfc = 0 

k—¥JC k—fJC 
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Since Hfc is positive definite, this also means that 

lim Axfe = 0 
k^K 

Considering now that Ax^ is a minimizer of Problem (6), we 
have 

0 e HfeAxfe+i9hi(xfe+Axfc)+V/i(xfe)-V/2(xfc)-(9h2(xfc) 

Now, by taking limits on both side of the above equation for 
k € 1C, we have 

0 G dhi{x*) + V/i(x*) - dh 2 {x*) - V/ 2 (x*) 

Thus, X* is a stationary point of Problem (1). □ 

The above proposition shows that under simple conditions 
on /i, any limit point of the sequence {x^} is a stationary 
point of F. Hence the proposition is quite general and applies 
to a large class of functions. If we impose stronger constraints 
on the functions /i, / 2 , hi and / 12 , it is possible to leverage 
on the technique of Kurdyka-Lojasiewicz (KL) theory [42], 
recently developed for the convergence analysis of iterative 
algorithms for non-convex optimization, for showing that the 
sequence {x^} is indeed convergent. Based on the recent 
works developed in Attouch et al. [42], [43], Bolte et al. [44] 
and Chouzenoux et al. [40], we have carried out a convergence 
analysis of our algorithm for functions F that satisfies the KL 
property. However, due to the strong restrictions imposed by 
the convergence conditions (for instance on the loss function 
and on the regularizer) and for a sake of clarity, we have post¬ 
poned such an analysis to the appendix. 

IV. Experiments 

In order to provide evidence on the benefits of the proposed 
approach for solving DC non-convex problems, we have 
carried out two numerical experiments. First we analyze our 
algorithm when the function / is convex and the regularizer 
h is a non-convex and a non-differentiable sparsity-inducing 
penalty. Second, we study the case when both / and h are 
non-convex. All experiments have been run on a Notebook 
Linux machine powered by a Intel Core i7 with 16 gigabytes 
of memory. All the codes have been written in Matlab. 

Note that for all numerical results, we have used a limited- 
memory BEGS (L-BFGS) approach for approximating the 
Hessian matrix through rank-1 update. This approach is 
well known for its ability to handle large-scale problems. By 
default, the limited-memory size for the L-BFGS has been set 
to 5. 

A. Sparse Logistic Regression 

We consider here /(x) as the following convex loss function 

i 

/(^) = + exp(-j/*a7x)) 

i=l 

where {a.i,yi}j^i are the training examples and their associ¬ 
ated labels available for learning the model. The regularizer 


we have considered is the capped-fi defined as h{x) = 
hi{x) — ft. 2 (x) with 

hi(x) = A||x||i and /i 2 (x) = A(||x||i - ^ (18) 

and the operator (m)+ = u if m > 0 and 0 otherwise. 
Note that here we focus on binary classification problems but 
extension to multiclass problems can be easily handled by 
using a multinomial logistic loss instead of a logistic one. 

Since several other algorithms are able to solve the op¬ 
timization problem related to this sparse logistic regression 
problem as given by Equation (1), our objective here is to 
show that the proposed DC proximal Newton is compu¬ 
tationally more efficient than competitors, while achieving 
equivalent classification performances. For this experiment, 
we have considered as a baseline, a DCA algorithm [18] 
and single competitor which is the recently proposed GIST 
algorithm [34]. Indeed, this latter approach has already been 
shown by the authors to be more efficient than several other 
competitors including SCP (sequential convex programming) 
[35], Multistage Sparsa [45]. As shown in Table I, none of 
these competitors handle second-order information for a non- 
convex regularization term. But the computational advantage 
brought by using this second order information has still to 
be shown since in practice, the resulting numerical cost per 
iteration is more important in our approach because of the 
metric term H^. As second-order methods usually suffer more 
for high-dimensionality problems, the comparison has been 
carried out when the dimensionality d is very large. Finally, 
a slight advantage has been provided to GIST as we consider 
its non-monotone version (more efficient than the monotone 
counterpart) whereas our approach decreases the objective 
value at each iteration. Although DC algorithm as described 
in section II-C has already been shown to be less efficient 
than GIST in [46], we have still reported its results in order 
to confirm this tendency. Note that for the DC approach, we 
allowed a maximum of 20 DC iterations. 

1) Toy dataset: We have firstly evaluated the baseline DC 
algorithm, GIST and our DC proximal Newton on a toy dataset 
where only few features are relevant for the discrimination 
task. The toy problem is the same as the one used by [47]. 
The task is a binary classification problem in Among these 
d variables, only T of them define a subspace of in which 
classes can be discriminated. For these T relevant variables, 
the two classes follow a Gaussian pdf with means respectively 
/i and —/i and covariance matrices randomly drawn from 
a Wishart distribution, p has been randomly drawn from 
{ —1,-|-1}^. The other d — T non-relevant variables follow 
an i.i.d Gaussian probability distribution with zero mean and 
unit variance for both classes. We have respectively sampled 
N, and tit = 5000 number of examples for training and 
testing. Before learning, the training set has been normalized 
to zero mean and unit variance and test set has been rescaled 
accordingly. The hyperparameters A and 6 of the regularization 
term (18) have been roughly set so as to maximize the perfor¬ 
mance of the GIST algorithm on the test set. We have chosen 
to initialize all algorithms with zero vector (xq = 0) and 
we terminate them if the relative change of two consecutive 
objective function values is less than 10“®. 
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Table II 

Comparison between DCA, GIST and our DC Proximal Newton on toy problems with increasing number of relevant variables. 
Performances reported in bold are statistically significantly different than their competitor counterpart according to a 

WILCOXON SIGNED RANK TEST WITH A P-VALUE AT 0.001. A MINUS SIGN IN THE RELATIVE OBJECTIVE VALUE INDICATES THAT THE DC PROXIMAL 

Newton approach provides larger objective value than GIST. The hyperparameters A and 0 have been chosen so as to maximize 

PERFORMANCES OF GIST. 


d= 2000, N= 100000, A = 2.00 9 = 0.20 


T 

DCA 

Class. Rate (%) 
GIST 

DC-PN 

DCA 

Time (s) 
GIST 

DC-PN 

Obj Val (%) 
Rel. Diff 

50 

92.18±0.0 

92.18±0.0 

91.94±0.0 

255.40±0.0 

95.42±0.0 

70.17±0.0 

-6.646 

100 

91.84±1.9 

91.84±1.9 

91.78±1.9 

117.07±21.4 

60.02±9.9 

44.42±12.0 

-1.095 

500 

91.52±0.8 

91.52±0.8 

91.50±0.8 

137.85±14.1 

57.41±5.2 

46.87±13.0 

-0.339 

1000 

91.69±0.7 

91.69±0.7 

91.69±0.7 

148.97±9.9 

61.18±6.4 

49.05±15.6 

-0.198 



d 

= 10000, N= 

5000, A = 2.00 9 = 2.00 





Class. Rate (%) 



Time (s) 


Obj Val (%) 

T 

DCA 

GIST 

DC-PN 

DCA 

GIST 

DC-PN 

Rel. Diff 

50 

88.55±2.5 

88.53±2.5 

88.57±2.5 

96.28±30.4 

48.82±11.5 

26.54±2.3 

0.025 

100 

87.81±2.8 

87.76±2.8 

87.81±2.8 

72.55±7.6 

38.30±6.6 

24.27±2.5 

0.016 

500 

81.82±0.9 

81.78±0.9 

81.82±0.9 

71.91±6.0 

33.73±2.7 

21.67±0.9 

0.004 

1000 

76.23±0.9 

76.20±0.9 

76.23±0.9 

74.41±7.9 

32.79±3.2 

21.59±0.9 

0.007 


Reported performances and running times averaged over 
30 trials are depicted in Table II for two different settings 
of the dimensionality d and the number of training exam¬ 
ples N. We note that for both problems our DC proximal 
Newton is computationally more efficient than GIST, with 
respect to the stopping criterion we set, while the recognition 
performances of both approaches are equivalent. As expected 
and as discussed above, the DC algorithm is substantially 
slower than GIST and our approach. Interestingly, we can 
remark that the competing algorithms do not reach similar 
objective values. This means that despite having the same 
initialization to the null vector, all methods have a different 
trajectories during optimization and converge to a different 
stationary point. Although we leave the full understanding of 
this phenomenon to future works, we conjecture that this is 
due to the primal-dual nature of the DC algorithm [37] which 
is in contrast to the first-order primal descent of GIST. 

2) Benchmark datasets: The same experiments have been 
carried out on real-world high-dimensional learning problems. 
These datasets are those already used by [34] for illustrating 
the behaviour of their GIST algorithm. Here, the available 
examples are split in a training and testing set with a ratio 
of 80% — 20 % and hyperparameters have been roughly set to 
maximize performance of GIST. 

From Table III, we can note that while almost equivalent, 
recognition performances are sometimes statistically better for 
one method than the other although there is no clear winner. 
From the running time point of view, our DC proximal Newton 
always exhibits a better behaviour than GIST. Indeed, its 
running time is always better, regardless of the dataset, and 
the difference in efficiency is statistically significantly better 
for 4 out of 5 datasets. In addition, we can note that in 
some situations, the gain in running time reaches an order of 
magnitude, clearly showing the benefit of a proximal Newton 
approach. Note that the baseline DC approach is slower than 
our DC proximal Newton except for one dataset where it 
converges faster than all methods. For this dataset, the DC 


algorithm needed only very few DC iterations explaining its 
fast convergence. 

B. Sparse Transductive Logistic Regression 

In this other experiment, we show an example of situation 
where one has to deal with a non-convex loss function as 
well as a non-convex regularize^ namely : sparse transductive 
logistic regression. The principle of transductive learning is to 
leverage unlabeled examples during the training step. This is 
usually done by using a loss function for unlabeled examples 
that enforces the decision function to lie in regions of low 
density. A way to achieve this is the use of a symmetric 
loss function which penalizes unlabeled examples lying in the 
margin of the classifier. It is well known that this approach, 
also known as low density separation, leads to non-convex 
data fitting term on the unlabeled examples [48]. For instance, 
Joachims [49] has considered a Symmetric Hinge loss for the 
unlabeled examples in their transductive implementation of 
SVM. Collobert et al. [38] extended this idea of symmetric 
Hinge loss into a symmetric ramp loss, which has a plateau 
on its top. In order to have a smooth transductive loss, Chapelle 
et al [48] used a symmetric sigmoid loss. 

For our purpose the transductive loss function is required to 
be differentiable. Hence we propose the following symmetric 
differentiable loss that can be written as a difference of convex 
function 

T{u) = 1 - - g 2 {u) 

where gi{u) = ^{g{u) - g{u + r)), g 2 {u) = gi{-u) and 
g{u) = log(l-|-exp(—u)). Note that g{u) is a convex function 
as depicted in Figure 1 and combinations of shifted and 
reversed versions of g{u) lead to gi and 52 - t is a parameter 
that modifies the smoothness of T{-). From the expression 
of gi and g 2 , it is easy to retrieve the difference of convex 
functions form of T{u) = Tjjn) — T 2 (u) with Ti{u) = 1-1- 
^{g{u + T)+g{-u + T)) and T 2 (u) = ^ {g{u) + g{-u)). 



Table III 

Comparison between DCA, GIST and our DC Proximal Newton on real-world benchmark problems. The eirst columns of the 

TABLE PROVIDE THE NAME OF THE DATASETS, THEIR STATISTICS. PERFORMANCES REPORTED IN BOLD ARE STATISTICALLY SIGNIFICANTLY DIFFERENT 
THAN THEIR COMPETITOR COUNTERPART ACCORDING TO A WiLCOXON SIGNED RANK TEST WITH A P-VALUE AT 0.001. A MINUS SIGN IN THE 
RELATIVE OBJECTIVE VALUE INDICATES THAT THE DC PROXIMAL NEWTON APPROACH PROVIDES LARGER OBJECTIVE VALUE THAN GIST. 


dataset 

N 

d 

DCA 

Class. Rate (%) 
GIST 

DC-PN 

DCA 

Time (s) 

GIST 

DC-PN 

Obj Val (%) 
Rel. Diff 

la2 

2460 

31472 

91.32±0.9 

91.67±0.9 

91.81±0.9 

36.61±11.5 

45.86±26.4 

21.74±11.9 

-165.544 

sports 

6864 

14870 

97.86±0.4 

97.94±0.3 

97.94±0.3 

88.99±70.8 

161.45±162.6 

23.76±13.7 

-95.215 

classic 

5675 

41681 

96.93±0.6 

97.33±0.5 

97.38±0.5 

3.44±3.8 

31.60±11.7 

17.44±7.6 

-418.789 

ohscal 

8929 

11465 

87.05±0.6 

87.99±0.6 

89.27±0.6 

320.39±134.5 

44.78±21.6 

19.13±25.1 

-85.724 

real-sim 

57847 

20958 

95.16±0.3 

96.28±0.2 

96.05±0.2 

63.81±96.3 

382.70±813.1 

23.14±9.3 

-105.902 



Figure I. Example of a non-convex smooth transductive loss function T(-) obtained with t = 1 as well as its components, (left) gi{u), (middle) 92 ( 11 ), 
(right) DC decomposition of T{u). 


The transductive loss T{-) as well as gi and g 2 and their 
components are illustrated in Figure 1. 

According to this definition of the transductive loss, for our 
experiments, we have used the following loss involving all 
training examples 

f{^)=J 2 log(l + exp(-y.a 7 x)) + 7 £ T{hJ x) (19) 

i=i j=i 

{aLi,yi] being the labeled examples and {b^} the unlabeled 
ones and 7 is an hyperparameter that balances the weight 
of both losses. As previously the capped-^i serves as a 
regularizer. 

1) Toy dataset: In order to illustrate the benefit of our 
sparse transductive approach, we have considered the same 
toy dataset as in the previous subsection and the same experi¬ 
mental protocol. However, we have considered only 5 relevant 
variables, sampled 100 training examples and 5000 testing 
examples. In addition, we have considered 10000 unlabeled 
examples. The total number of variables is varying. We have 
compared the recognition performance of 3 algorithms ; the 
above-described capped-fi sparse logistic regression, the non- 
sparse transductive SVM (TSVM) of [48]’ and our sparse 
transductive logistic regression. 

Evolution of the recognition rate of these algorithms with 
respects to the number of variables in the learning problem 
is depicted in Figure 2. Interestingly, when the number of 
variables is small enough, all algorithms perform equivalently. 
Then, as the number of (noisy) variables increases, the trans¬ 
ductive SVM suffers a drop of performances. It seems more 
beneficial in this case to consider a model that is able to select 

*we used the code available on the author’s website. 



Figure 2. Recognition rate of different algorithms that are either sparse, 
transductive or both with respects to the number of variables in the problem, 
the number of relevant variables being 5. 


relevant variables as our capped-£i sparse logistic regression 
still performs good. Best performances are obtained using our 
DC formulation introduced for solving the sparse transductive 
logistic regression problem which is able to remove noisy 
variables and take advantage of the unlabeled examples. 

2) Benchmark datasets: We have also analyzed the ben¬ 
efit of using unlabeled examples in high-dimensional learn¬ 
ing problems. For this experiment, all the hyperparame¬ 
ters of all models have been cross-validated. For instance, 
A, 9 (parameters of the capped ii) and 7 have been re¬ 
spectively searched among the sets {0.2,2}, (0.2,2} and 
(0.005,0.001,0.005, 0.01}. averaged results over 10 trials are 
reported in Table IV. Note that the results of the transductive 
SVM of [48] have not been reported because the provided 
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Table IV 

Comparing the recognition rate oe a sparse logistic 

REGRESSION AND A SPARSE TRANSDUCTIVE LOGISTIC REGRESSION BOTH 
WITH CAPPED-£i REGULARIZER. I AND lu RESPECTIVELY DENOTES THE 
NUMBER OE LABELED AND UNLABELED EXAMPLES. 


For the sake of reproducible research, the code source of 
the numerical simulation will be freely available on the authors 
website. 


dataset 

d 

t 

Class. 

tu 

Rate (%) 
Sparse Log 

Sparse Transd. 

la2 

31472 

61 

2398 

67.65±2.6 

70.23±3.1 

sports 

14870 

85 

6778 

81.26±5.0 

88.15±4.4 

classic 

41681 

70 

5604 

72.74±4.3 

86.97±2.2 

ohscal 

11465 

55 

8873 

70.35±2.4 

73.39±3.6 

real-sim 

20958 

723 

57124 

88.81±0.3 

88.91±1.4 

url 

3.23x10® 

1000 

40000 

86.64±5.8 

87.39±6.0 


code was not able to provide a solution in a reasonable amount 
of time. Results in Table IV show that being able to handle 
non-convex loss functions, related to the transductive loss and 
non-convex sparsity-inducing regularizers helps in achieving 
better performances in accuracy. Again, we can remark that the 
benefits of unlabeled examples are compelling especially when 
few labeled examples are in play. Differences in performances 
are indeed statistically significant for most datasets. In order 
to further evaluate the accuracy of the proposed method in 
very high-dimensional setting, we have run the comparison on 
the URL dataset. This dataset involves about 3.10® features 
and we have learned a decision function using only 1000 
training examples and 40000 unlabelled examples. Although 
difference in performances is not significant, leveraging on 
unsupervised examples helps in improving accuracy. Note that 
for this problem, the average running times of our DC-based 
sparse logistic regression and the DC-based sparse transductive 
regression are respectively about 500 and 700 seconds. This 
shows that the proposed approach allows to handle large-scale 
and very high-dimensional learning problems. 

V. Conclusions 

This paper introduced a general proximal Newton algorithm 
that optimizes the composite sum of functions. A specificity 
of the approach is its ability to deal with the non-convexity of 
both terms while one of these terms is in addition allowed 
to be non-differentiable. While most of the works in the 
machine learning and optimization communities have been 
addressing these non-differentiability and non-convexity issues 
separately, there exists a number of learning problems such as 
sparse transductive learning that require efficient optimization 
scheme on non-convex and non-differentiable functions. Our 
algorithm is based on two steps: the first one looks for a 
search direction through a proximal Newton step while the 
second one performs a line search on that direction. We also 
provide in this work the proof that the iterates generated by 
this algorithm behaves correctly in the sense that limit points 
of the sequences are stationary points. Numerical experiments 
show that the second order information used in our algorithm 
through the matrix allow faster convergence than proximal 
gradient based descent approaches for non-convex regulariz¬ 
ers. One of the strength of our framework is its ability to 
handle non-convexity on both the smooth loss function and 
the regularizer. We have illustrated this ability by learning a 
sparse transductive logistic regression model. 


VI. Appendix 

A. Details on the proximal expression of Ax^ 

We provide in this paragraph the steps for obtaining Equa¬ 
tion (7) from Equation (6). 

Remind that for a lower semi-continuous convex function 
hi, the proximal operator is defined as [15] 

y* = prox^ (x) = argmini||y - x||^ -f hi{y) 
y 2 

y* can be characterized by the optimality condition of the 
optimization problem which is 

-H(y*-x) G5/ti(y*) 

The search direction is provided by Equation (6) which we 
remind is 

argmin -Ax^H^Ax + hi{xk + Ax) + v^^Ax 

Ax 2 

By posing z — Xk + Ax, we can equivalently look at a shifted 
version of this problem: 

Zfc = argmin i(z -Xfe)^Hfc(z - x^) -f hi{z) +vj{z- x^) 

Z ^ 

Optimality condition of this problem is 

-Hfc(zfc - (xfc - H"^Vfc)) G dhi{zk) 

Hence, according to the optimality condition of the proximal 
operator, we have 

Zfc = prox“ (xfc - Vfc) 

and thus 

Axfc = prox“ (xfc - H~^Vfe) - xfc 
which is Equation (7). 

B. Lemma 3 and proof 

Lemma 3 : Eor x in the domain of / and assuming that 
Hfc ^ ml with TO > 0 and V/i is Lipschitz with constant L 
then the sufficient condition in Equation (17) holds for all tk 
so that 

tk A min 

Proof: Recall that x_(. := x^ -l-ffeAxfe. By definition, we have 

F(x+) - F(x) = /l(x+) - /i(x) - /2(x+) -f /2(x) 

-I- /ll(x+) - /li(x) - /l 2 (x+) -I- /l 2 (x). 

Then by convexity of / 2 , /12 and hi, we derive that (see 
equation (14)) 

i^(x+) - i^(x) < /l(x+) - /l(x) -f (1 - f)/tl(x) 

-I- f/ii(x -I- Ax) - (z /2 -f Zhf}^If Lx-k) 

- /ii(x) 
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According to a Taylor-Laplace formulation, we have ; 

/i(x+) -/i(x) = f V/i(x + 5iAx)"^(tAx)(i5 
thus, we can rewrite 


F(x+) — F(x) < f Vfi{x + stAx)^{tAx.)ds—thi{x) 

Jo 

+ thi{x + Ax) - (z/2 + Zh2)^itAx) 

< J (vfi{x + stAx) — Vfi{x)^ {tAx)ds 


+ thi{x + Ax) + V/i(x)^(tAx) 

- (z /2 + z/j 2 )^(tAx) - thi{x) 

< (vfi{x + stAx) — V/i(x)^ {Ax)ds 

+ hi{x + Ax) + V/i(x)^(Ax) 

- (z/2 +2h^V{Ax) - hi{x)'j 


C. Convergence property for F satisfying the KL property 

Proposition 1 provides the general convergence property 
of our algorithm that applies to a large class of functions. 
Stronger convergence property (for instance, the convergence 
of the sequence {x^} to a stationary point of F{x)) can be 
attained by restricting the class of functions and by imposing 
further conditions on the algorithms and some of its parame¬ 
ters. For instance, by considering functions F{x) that satisfy 
the so-called Kurdyka-Lojasiewiszc property, convergence of 
the sequence can therefore be established. 

Proposition 2: Assume the following assumptions: 

• hypotheses on / and h given in section II are satisfied 

• ft, is continuous and defined over 'Ef 

• Hfc is so that H^. ^ ml for all k and m > 0. 

• F is coercive and it satisfies the Kurdyka-Lojasiewicz 
property, 

• ft 2 verifies the L 2 -Lipschitz gradient property, and thus 
there exists constant 


Then using Cauchy-Schwartz inequality and the fact that /i 
is gradient Lipschitz of constant L, we have : 

F(x+) — F{x) < sfF||Ax|| 2 (is 

-I- fti(x + Ax) -f V/i(x)^(Ax) 

- (z /2 + 2 ^ 2 )^(Ax) - fti(x)) 

< ^(yIIAxII^ 

-I- fti(x -f Ax) + V/i(x)"''(Ax)- 
(z /2 + 2 h^V{Ax) - fti(x)) 
l|Ax||2 

-I- fti(x -f Ax) - fti(x)-|-vj(Ax)^ 

<t[^\\Ax\\l + D) 


Now, if t is so that 


then 

y||Ax||i < m(l -a)||Ax||i 

= (1 — Q;)Ax"''(mI)Ax 

< (1 — a)Ax^HAx 

< -{l-a)D 


where the last inequality comes from the descent property. 
Now, we plug this inequality back and get 

f^^||Ax ||2 + Dj < — (1 — a)D + = taD 

which concludes the proof that for all 

t < min (1 , 2m 


||u-v||2 < L/j2||x-y||2 u e aft2(x) and v € 5 ft 2 (y) 

• at each iteration, is so that the function /i(z,Xfc) = 
/i(xfc)-fV/i(xfc)^(z-Xfc)-f i||z-Xfc|||j^ is a majorant 
approximation of /i(-) i.e 

/i(z) </i(z,Xfc) Vz 

• there exists an d G (0,1] so that at each iteration the 
condition 


^'(xfc+i) < (1 - d)F(xfc) -I- aF{zk) 

holds. Here, Zk is equal to x^ + Ax as defined in 
Appendix A. 

Under the above assumptions, the sequence {x^.} generated 
by our algorithm (1), converge to a critical point of F = f + h. 

Before stating the proof, let us note that these conditions 
are quite restrictive and thus it may limit the scope of the 
convergence property. For instance, the hypothesis on ft 2 holds 
for the SCAD regularizer but does not hold for the capped-fi 
penalty. We thus leave for future works the development of 
an adaptation of this proximal Newton algorithm for which 
convergence of the sequence {x^} holds for a larger class of 
regularizers and loss functions. 

Proof: The proof of convergence of sequence {x^} strongly 
relies on Theorem 4.1 in [40]. Basically, this theorem states 
that sequences {x^} generated by an algorithm minimizing a 
function F = f + h with ft being convex and F satisfying 
Kurdyka-Lojasiewicz property converges to a stationary point 
of F under the above assumptions. The main difference 
between our framework and the one in [40] is that we 
consider a non-convex function ft(x). Hence, for a sake of 
brevity, we have given in what follows only some parts of 
the proofs given in [40] that needed to be reformulated due 
to the non-convexity of ft(x). 


we have 

F(x+) — F{x) < taD 


i) sufficient decrease property. This property provides similar 
guarantee than Lemma 4.1 in [40]. This property easily derives 
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from Equations (17) and (15). Combining these two equations In addition, owing to the optimality condition of z^, the 
tells us that following hold 


F(xfc+i) - F(xfc) < -affcAx^HAx 


Hfc(Zfc - Xfc) = V/l(xfc) - V/ 2 (xfc) + Zhi,z^, - 


where by definition, we have x^+i = x^ + f^Ax. Thus, we 
get 


Q! 

F{xk+i) - F(xfc) < - —||xfc+i - Xfc 

tk 


IlHfc 


OtTfl II II2 

^ 7 II^Ai+l ~ ^fc||2 

tk 

< -Q;m||xfc+i - Xfc||2 


which proves that a sufficient decrease occurs at each iteration 
of our algorithm. In addition, because Xfc+i — Xfc = ffcAx = 
ffc(zfc — Xfc), we also have 

F{-Xk+i) < F(xfc) - amf\\zk+i - y^kWl (20) 


where t is the smallest tk we may encounter. According to 
Lemma 3, we know that f > 0. 

ii) convergence of F{zk) remind that we have defined Zfc 
as (see appendix A) 

Zfc = argmini(z - Xfc)^Hfc(z - Xfc) + hi{z) + v^(z -Xfc) 

Z ^ 

which is equivalent, by expanding Vfc and adding constant 
terms, to 

mini(z - Xfc)^Hfc(z - Xfc) + /i(xfc) + V/i(xfc)^(z - Xfc) 

z 2 

- /2(xfc) - V/2(xfc)^(z - Xfc) 

- /l2(xfc) - 9/l2(xfc)^(z - Xfc) 

+ hi{z) 

Note that the terms in the first line of this minimization 
problem majorize fi by hypotheses and the terms in the second 
and third lines respectively majorizes —/2 and —/12 since 
they are concave function. When we denotes as Q{z,Xk) the 
objective function of the above problem, we have 

F(zfc) < (3(zfc,Xfc) < (3(xfc,Xfc) = F(xfc) (21) 

where the first inequality holds because Q{z,Xk) majorizes 
F{z), the second one holds owing to the minimization. Com¬ 
bining this last equation with the assumption on E'(xfc+i), we 
have 


a ^(F(xfc+i) - (1 - a)i^(xfc)) < F{zk) < F(xfc) 

This last equation allows us to conclude that if F{xk) con¬ 
verges to a real then F{zk) converges to 
hi) bounding subgradient at F{zk) 

A subgradient z^^ of F at a given Zfc is by definition 

zf = V/i(Zfc) - V/2(Zfc) -f - Zfc2_^^ 

where G dhi{zk) and Zh^^^k C 9/i2(zfc). Hence, we 

have 


ZfII <||V/i(Zfc) - V/i(Xfc)|| -f ||V/2(Zfc) - V/2(Xfc)|| 

+ ~ II 

+ l|V/l(xfc) - V/ 2 (xfc) -f -ZfcJI 


Hence, owing to the Lipschitz gradient hypothesis of fi and 

/2 and the hypothesis on ft, 2 , there exists a constant p > 0 

such that 

||zf|| < Ml|zfc - Xfcll (22) 
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